July 19, 2020
Assume some background about GWAS.
Overview of the workflow:
Upload (e.g. your phenotype data)
Data Libraries (e.g. 3K geno data)
Other methods (not in this tutorial)
For this tutorial, we will use a subset of 3,000 rice genomes dataset.
We randomly selected 500 samples, then filtered and LD-pruned the dataset.
Resulting dataset is 500 samples \( \times \) 629 k SNPs.
Navigate to Data Libraries in the top menu
Select Test Dataset. Then select GEMMA
Create a new “History” named “myGWAS”.
If successful, will look like this:
LD pruning
Joining phenotype and genotype data
PCA (Population structure)
Kinship matrix
Removing SNPs that are correlated (“in high LD”) to other SNPs.
Why LD pruning?
| LD pruned | Filtered |
|---|---|
| Less SNPs | More SNPs |
| Faster GWAS | Slower GWAS |
| Lower peaks (miss top SNPs) | Higher peaks (more complete) |
| More correct QQ plot | Biased QQ plot |
Select PLINK tools then LD prune
Select appropriate datasets
Parameters:
Click “Execute”
Why need PCA
We can use PCA in GWAS in two wayss
Find the tool Compute Principal Components within GEMMA tools section.
Select the datasets.
Click Execute
The history will be updated by 3 files:
We will use the “PCA covariate file”.
We will use the Visualization menu in the top navbar. This feature requires registration.
Search for Scatterplot, select “nvd3” scatterplot.
Put appropriate axes labels, select data sources
We could either use PCA covariates file directly, or, if we decide to use smaller number of PCs, cut the first N+1 column.
Let's use the first 3 PCs only.
The “Cut” tool allows specifying columns as “c1-c4”.
Tool: Add Pheno to FAM file
Select FAM from genotype dataset and a phenotype dataset
This creates 4 outputs (2 for each trait)
Wrong way
Right: use a .fam file that has phenotype
We will first do MLM based GWAS
Tool: Mixed Linear Model
Expected outputs:
Manhattan plot
QQ plot - check P-value inflation
Methods of multiple test corrections.
Bonferroni : \[ \alpha_{Bonf} = \frac{\alpha}{number \ of \ SNPs} \] too conservative, assuming independence of each SNP
\( - log_{10}(\alpha_{Bonf}) = 6.6 \)
FDR (False Discovery Rate) : similar to Bonferroni for top SNPs.
Effective number of independent tests.
*Not yet in galaxy, but for this dataset the number is
Not yet in galaxy, but for this dataset the effective number of SNPs is about twice less than the given. (~96k vs ~200k).
In terms of Manhattan plot threshold, \[ - log_{10}(\alpha_{Gao}) = 6.2 \] instead of 6.6 (~7 for original data).
Results for MLM with 3 PCs as covariates
This is LD pruned dataset (\( r^2 =0.6 \)). Using the original data, the peak on chr3 is even higher (>14).
It is good to compare MH plots from different models.
Almost no inflation seen in most (~90%) of p-values.
This means the correction based on relatedness works well.
Compare this to linear model (i.e. not MLM, but GLM).
Compare basic LM and MLM
How do I find where peak starts and ends?
Why bother? Shouldn't I look at only top SNP?
True or false: the casual SNP in the peak is one with highest p-value?
Undetected variants
Missing calls, overlapping deletions - effect on p-values??
Need to look at the whole region of strong LD with the significant SNPs.
LD-based clumping helps select SNPs subsets that satisfy
It is a convenient way to delineate a haplotype block based on relevant SNPs.
“Verbose” Report
Ranges Report
Note this shows that the peak on Chr3 seems to consist of 3 independent (no strong LD), although overlapping, haplotype blocks.
Peak SNP annotation with effect (planned)
plink –annotate
Peak haplotypes analyses
HaploTool
subsequent post-GWAS data integration and analysis
(However, stay for Ramil Mauleon's talk!)